Online Markov Decision Processes Under Bandit Feedback
نویسندگان
چکیده
منابع مشابه
Online Markov Decision Processes
We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which ha...
متن کاملOnline Stochastic Optimization under Correlated Bandit Feedback
In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel any-time X -armed bandit algorithm, and derive regret bounds matching the performance of existing state-of-the-art in terms of dependency on number of steps and smoothness factor. The main advantage of HCT is t...
متن کاملA Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes
We formulate a multi-armed bandit (MAB) approach to choosing expert policies online in Markov decision processes (MDPs). Given a set of expert policies trained on a state and action space, the goal is to maximize the cumulative reward of our agent. The hope is to quickly find the best expert in our set. The MAB formulation allows us to quantify the performance of an algorithm in terms of the re...
متن کاملPAC Bounds for Multi-armed Bandit and Markov Decision Processes
The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O( n 2 log 1 δ ) times to find an -optimal arm with probability of at least 1 − δ. This is in contrast to the naive bound of O( n 2 log n δ ). We derive another algorithm whose complexity depends on the specific setting of the rewards,...
متن کاملOnline Markov decision processes with policy iteration
The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2014
ISSN: 0018-9286,1558-2523
DOI: 10.1109/tac.2013.2292137